{
 "metadata": {
  "language": "Julia",
  "name": "",
  "signature": "sha256:fb963d1687047d692b75fe6095ece0da69483e9055db9c34a86e4e4cf7054dbf"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Ordering a pedigree by LAP"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We represent a pedigree with two integer vectors, `sire` and `dam`, of the same length, say `n`.  The elements of these vectors must be in the range `0:n` with `0` representing an unknown parent.\n",
      "\n",
      "For each of the `n` animals the _longest ancestral path_ (LAP) is the maximum length of a path that can be traced from the animal through its ancestors.  If both `sire[i]` and `dam[i]` are zero then the longest ancestral path for animal `i` is zero.\n",
      "\n",
      "In general we want a pedigree to be sorted so that the parents of an animal have lower indices than the animal itself.  If it is possible to sort the pedigree so that all the animals with LAP = 0 occur first followed by all the animals with LAP = 1 followed by all the animals with LAP = 2, etc. then we are guaranteed that the parents occur before the offspring.  (The LAP of the offspring is always greater than the LAP of both of its parents.)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Outline of the algorithm"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The algorithm is straightforward but, as always, getting the details right can be a bit tedious.  The function `laporder` takes the `sire` and `dam` vectors and returns the re-ordering permutation, `ord`, the indices of the last element in the permuted ordering before a change in the LAP, `lappt`, and modified sire and dam vectors, `ss` and `dd`, under the new ordering.\n",
      "\n",
      "Those animals with LAP = 0 are easy to identify because both `sire` and `dam` must be zero.  Their indices are the leading part of the `ord` vector.  The convention for the `lappt` vector is that the first element is always zero so the second element is the number of animals in the LAP0 group.  The set of possible ancestors at the next stage is the union of `[0]` and the LAP0 group.\n",
      "\n",
      "The algorithm proceeds iteratively from the LAP0 group to the LAP1 group to the LAP2 group and so on. At each stage the next LAP group is determined according to both the sire and dam being in the current ancestor, `anc`, set."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Using `IntSet`s in `Julia`"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The [Julia](http://julialang.org) language provides an `Intset` data type to represent a set of integers as a string of bits.  Operations like union, intersection and set difference are expressed as logical operations on bits in a register, making this type very efficient.  In the example shown below there are 6547 animals, sets of which are represented as a vector of 103 64-bit integers.\n",
      "\n",
      "Many Julia functions are _mutating_ functions, meaning that they modify one or more of their arguments.  The iteration involves three IntSets, the current set of ancestors, `anc`, the current unassigned population, `pop`, and the next generation, `nextgen`.  The update step is to detect and record the next generation then move those indices from `pop` to `anc`.  The function source code is in the file [`types.jl`](https://github.com/Rpedigree/pedigree.jl/blob/master/src/types.jl).  Notice that once `nextgen` has been detected and recorded, the rest of the update consists of\n",
      "```jl\n",
      "        union!(anc,nextgen)\n",
      "        setdiff!(pop,nextgen)\n",
      "        empty!(nextgen)\n",
      "```\n",
      "\n",
      "An interesting characteristic of the algorithm is that there is essentially no storage allocated during the iterative loop.  Elements are pushed onto the end of the `ord` and `lappt` vectors but those have been overallocated to allow for the additional elements."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "An Example"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The [`pedigreeR`](https://github.com/Rpedigree/pedigreeR) package for [R](http://www.R-project.org) contains a dataset named `pedCows`. After suitable conversion of the `NA`'s to zeros, the `sire` and `dam` vectors are stored in an HDF5 file as shown below."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "using pedigree,HDF5\n",
      "fid = h5open(Pkg.dir(\"pedigree\",\"data\",\"pedCows.h5\"))\n",
      "sire = readmmap(fid[\"sire\"])\n",
      "dam = readmmap(fid[\"dam\"])\n",
      "close(fid)\n",
      "dump(sire)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Array(Int16,(6547,)) Int16"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[0,0,0,0,0,0,0,0,0,0  \u2026  1630,1630,1630,1630,1630,1630,2793,2793,2793,1630]\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Because the arrays `sire` and `dam` will be read-only we memory-map them, which provides more efficient storage.  For this size of pedigree this isn't important but for very large data sets memory mapping can be very effective.\n",
      "\n",
      "Finally we call the ordering function."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ord,lappt,ss,dd = laporder(sire,dam)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "([1,2,3,4,5,6,7,8,9,10  \u2026  6439,6472,6476,6477,6484,6485,4951,5797,5877,6022],[0,1866,2277,2606,3227,4124,4851,5860,6438,6543,6547],Int16[0,0,0,0,0,0,0,0,0,0  \u2026  3541,4209,5865,5865,3541,3541,4858,4430,4428,4209],Int16[0,0,0,0,0,0,0,0,0,0  \u2026  5917,5920,4710,3917,5916,5919,6442,6440,6441,6439])"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is very fast and uses only a small amount of memory in total."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "@time laporder(sire,dam);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "elapsed time: 0."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "003499049 seconds (568952 bytes allocated)\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can check that the resulting pedigree of `ss` and `dd` is indeed sorted by LAP order."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "p = Pedigree(ss,dd)\n",
      "dump(p.lap)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Array(Int16,(6547,)) Int16[0,0,0,0,0,0,0,0,0,0  \u2026  8,8,8,8,8,8,9,9,9,9]\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The constructor for the `Pedigree` type currently performs several consistency checks and evaluates the `lap` member separately.  The differences in the `lappt` elements are the number of animals in each group."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "show(diff(lappt))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[1866"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ",411,329,621,897,727,1009,578,105,4]"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The reason for storing the `lappt` vector in the form that is used here is so that the LAP group of any animal can be easily determined."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}